Add some additional information to customize the knitted document:

date: "September 30, 2020"
output:
  html_document:
    number_sections: yes
    theme: cerulean
    toc: yes
    toc_depth: 5
    toc_float: yes
  pdf_document:
    toc: yes
    toc_depth: '5'

This will add a table of contents (toc) and will change the colors (theme: cerulean)

To find your favorite Rmarkdown theme: https://www.datadreaming.org/post/r-markdown-theme-gallery/

knitr::opts_chunk$set(cache=TRUE, fig.path='figures/', fig.width=8, fig.height=5 )

This saves all figures in the directory figures and sets the default figure size

1 R Markdown

This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.

Rmarkdown Cheatsheet: https://rmarkdown.rstudio.com/lesson-15.html

“#” hash signs indicate headers.

The number of hashes equals the header level.

1.1 h2

1.1.1 h3

1.1.2 h4

placing a single asterisk on either side of a phrase makes it italic.

double asterisks make a word or phrase bold.

triple asterisks make a word or phrase bold and italic.

  • a single asterisk at the beginning of a line makes a bullet
  1. and a number at the begining of a line creates a numbered item.
  2. this should add

When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:

summary(cars)
##      speed           dist       
##  Min.   : 4.0   Min.   :  2.00  
##  1st Qu.:12.0   1st Qu.: 26.00  
##  Median :15.0   Median : 36.00  
##  Mean   :15.4   Mean   : 42.98  
##  3rd Qu.:19.0   3rd Qu.: 56.00  
##  Max.   :25.0   Max.   :120.00

Execute this chunk by clicking the Run button within the chunk or by placing your cursor inside it and pressing Cmd+Shift+Enter.

1.2 Including Plots & Images

You can also embed plots, for example:

(Add a new chunk by clicking the Insert Chunk button on the toolbar or by pressing Cmd+Option+I.)

echo =FALSE will only display the output, not the code.

Some more chunk options: * Use echo=FALSE to avoid having the code itself shown. * Use results="hide" to avoid having any results printed. * Use eval=FALSE to have the code shown but not evaluated. * Use warning=FALSE and message=FALSE to hide any warnings or messages produced. * Use fig.height and fig.width to control the size of the figures produced (in inches).

naming chunks = good practice (the above chunk was named pressure) * helps navigate around the document & this is what the figures will be named

(check the Rproject directory after knitting)

You can also include images from your local computer or from the web:

#!

1.3 Adding tables

Can type out tables:

col name
1 1 1
2 2 2

Alternatively, you can use the knitr package to make mardown tables from data frames:

speed dist
4 2
4 10
7 4
7 22
8 16
9 10

left, right, center adjust

1.4 Knitting

When you knit the file, an HTML file containing the code and output will be saved alongside it (click the Knit button or press Cmd+Shift+K to preview the HTML file).

The preview shows you a rendered HTML copy of the contents of the editor (Viewer tab).

2 Rprojects

Rproject Benefits:

  • No need to set the working directory. All paths are relative to the directory containing the Rproject.

    Whenever you open your project, the working directory is automatically set to where your project is. This means your code will not break when you work on a different computer.

  • RStudio projects allow you to open multiple projects at the same time with each open to its own project directory. This allows you to keep multiple projects open without them interfering with each other.

Good organization / project lay out will:

  • ensure the integrity of your data
  • make it easier to collaborate
  • make it easier to a pick a project back up after a break

Project Management tips:

  • treat raw data as “read only”
  • create separate directory for “cleaned data” (or don’t save altered data files - will see later on with dplyr) - results
  • generated output is disposable (because your analysis is reproducible!)
  • put scripts in src directory
  • name all files to reflect their content or function (e.g. fig1_pca_communitycomposition.jpg not Rplot1.jpg)
  • avoid duplication - as code for a project matures, you will want to start splitting out functions into separate scripts. These scripts might be useful across multiple projects. When reusing a script, use a symbolic link to save space on your computer and avoid having to update a file in multiple places. Data that is reused can also be symbolically linked (ln -s)

data for this workshop

following good project management practices, make a new directory called data and download the data we will be playing with in this workshop into that directory:

In terminal tab:

mkdir data

cd data

wget https://raw.githubusercontent.com/swcarpentry/r-novice-gapminder/gh-pages/_episodes_rmd/data/gapminder_data.csv

curl

We will use the data later, but we can get a general sense of the data by looking at it in the terminal, which will help us decide how to load it into R later:

wc -l gapminder_data.csv

head gapminder_data.csv

cd -

3 GitHub

go to your GitHub account and make a new repository DO NOT initialize with a README

follow the instructions on the next page

(in terminal tab)

echo "# SkillPill_ReproducibleR" >> README.md
git init
git add README.md
git commit -m "first commit"
git remote add origin https://github.com/maggimars/SkillPill_ReproducibleR.git
git push -u origin master

README.md is a markdown file, just like this Rmarkdown file in many ways- uses similar syntax.

try also adding your data directory to your Github repository!

Alternatively - you can use the Rstudio interface to version control with Git https://swcarpentry.github.io/git-novice/14-supplemental-rstudio/

(I prefer command line)

4 A few notes on getting help

?function_name

If you can’t really remember a function name ??function_name

pro-tip From within the function help page, you can highlight code in the Examples and hit Ctrl+Return to run it in RStudio console. This is gives you a quick way to get a feel for how a function works.

?kable

for special operators use quotes, e.g. ?"<-" Without any arguments,vignette()will list all vignettes for all installed packages;vignette(package=“package-name”)will list all available vignettes for package-name, andvignette(“vignette-name”)will open the specified vignette. And then there is always google. # Reproducible and Streamlined Analyses (Day 2) ## Exploring the sample data We already looked at the sample data in Terminal and saw that it was a.csv` file with 1705 lines and that it does have a header.

gapminder<- read.csv("data/gapminder_data.csv", header = TRUE)

View data in another tab with View()

when your data is in a github repo - you can also use it directly from the repo:

library(data.table) # you might need to install this package
gapminder<- fread("https://raw.githubusercontent.com/swcarpentry/r-novice-gapminder/gh-pages/_episodes_rmd/data/gapminder_data.csv", header = TRUE)

4.1 Functions

reusable! (and therefore reproducible!)

Often start by writing a function within an interactive session.

Lets write a function that converts Fahrenheit to Celcius (bc I am moving to back to America and I’m going to need this)

fahr_to_celc <- function(temp) {
  celc <- ((temp - 32) * (5 / 9))
  return(celc)
}

get body temp in celcius: (seems to be important these days)

fahr_to_celc(98.6)
## [1] 37

Stopifnot

fahr_to_celc <- function(temp) {
  stopifnot(is.numeric(temp))
  celc <- ((temp - 32) * (5 / 9))
  return(celc)
}

What happens if you call with a number?

fahr_to_celc(100)
## [1] 37.77778

What if you call with a string?

#fahr_to_celc("hot")

Combining Functions:

Define two functions

  1. fahrenheit to celcius
  2. celcius to kelvin
#1
fahr_to_celc <- function(temp) {
  stopifnot(is.numeric(temp))
  celc <- ((temp - 32) * (5 / 9))
  return(celc)
}

#2
celc_to_kelv <- function(temp) {
  stopifnot(is.numeric(temp))
  kelv <- ((temp + 273.15))
  return(kelv)
}

Define a new function that calls both these functions to convert fahrenheit to kelvin:

fahr_to_kelv <- function(temp) {
  stopifnot(is.numeric(temp))
  tmp <- fahr_to_celc(temp)
  out<- celc_to_kelv(tmp)
  return(out)
}
fahr_to_kelv(100)
## [1] 310.9278

A more useful example:

Calculate gross domestic product in our data set:

# Takes a dataset and multiplies the population column
# with the GDP per capita column.
calcGDP <- function(dat) {
  gdp <- dat$pop * dat$gdpPercap
  return(gdp)
}
calcGDP(head(gapminder))
## [1]  6567086330  7585448670  8758855797  9648014150  9678553274 11697659231

But that is not super useful - lets add more arguments so we can extract per country per year :

# Takes a dataset and multiplies the population column
# with the GDP per capita column.
calcGDP <- function(dat, year=NULL, country=NULL) {
  if(!is.null(year)) {
    dat <- dat[dat$year %in% year, ]
  }
  if (!is.null(country)) {
    dat <- dat[dat$country %in% country,]
  }
  gdp <- dat$pop * dat$gdpPercap
  new <- cbind(dat, gdp=gdp)
  return(new)
}

default arguments are NULL

head(calcGDP(gapminder, year=2007))
##        country year      pop continent lifeExp gdpPercap         gdp
## 1: Afghanistan 1952  8425333      Asia  28.801  779.4453  6567086330
## 2: Afghanistan 1957  9240934      Asia  30.332  820.8530  7585448670
## 3: Afghanistan 1962 10267083      Asia  31.997  853.1007  8758855797
## 4: Afghanistan 1967 11537966      Asia  34.020  836.1971  9648014150
## 5: Afghanistan 1972 13079460      Asia  36.088  739.9811  9678553274
## 6: Afghanistan 1977 14880372      Asia  38.438  786.1134 11697659231
calcGDP(gapminder, country="Australia")
##           country year      pop continent lifeExp gdpPercap        gdp
##    1: Afghanistan 1952  8425333      Asia  28.801  779.4453 6567086330
##    2: Afghanistan 1957  9240934      Asia  30.332  820.8530 7585448670
##    3: Afghanistan 1962 10267083      Asia  31.997  853.1007 8758855797
##    4: Afghanistan 1967 11537966      Asia  34.020  836.1971 9648014150
##    5: Afghanistan 1972 13079460      Asia  36.088  739.9811 9678553274
##   ---                                                                 
## 1700:    Zimbabwe 1987  9216418    Africa  62.351  706.1573 6508240905
## 1701:    Zimbabwe 1992 10704340    Africa  60.377  693.4208 7422611852
## 1702:    Zimbabwe 1997 11404948    Africa  46.809  792.4500 9037850590
## 1703:    Zimbabwe 2002 11926563    Africa  39.989  672.0386 8015110972
## 1704:    Zimbabwe 2007 12311143    Africa  43.487  469.7093 5782658337

Challenge: Test out your GDP function by calculating the GDP for New Zealand in 1987. How does this differ from New Zealand’s GDP in 1952?

moving functions to rscripts and sourcing scripts (best practices for project management!)

4.2 Dplyr

lesson materials: http://swcarpentry.github.io/r-novice-gapminder/13-dplyr/index.html

Cheat Sheet: https://rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf

use “verbs” to wrangle your data

for example: filter()

or mutate():

Getting started with dplyr:

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:data.table':
## 
##     between, first, last
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

select()

pipes %>% and filter()

Challenge: Write a single command (which can span multiple lines and includes pipes) that will produce a dataframe that has the African values for lifeExp, country and year, but not for other Continents. How many rows does your dataframe have and why?

Using group_by() and summarize():

Challenge:

Calculate the average life expectancy per country. Which has the longest average life expectancy and which has the shortest average life expectancy?

using count() and n():

Using mutate():

Connecting mutate() with logical filtering: ifelse()

Some extras:

Good to know: dplyr can be used directly with ggplot2

library(ggplot2)
gapminder %>%
    # Filter countries that start with "A" or "Z"
    filter(substr(country, start = 1, stop = 1) %in% c("A", "Z")) %>%
    # Make the plot
    ggplot(aes(x = year, y = lifeExp, color = continent)) +
    geom_line() +
    facet_wrap( ~ country)

4.3 Tidyr

Lesson Materials: http://swcarpentry.github.io/r-novice-gapminder/14-tidyr/index.html

CheatSheet: https://github.com/rstudio/cheatsheets/blob/master/data-import.pdf

be aware: Gather/Spread has been renamed pivot_longer / pivot_wider

Extra information:

vignette("pivot")
## starting httpd help server ... done

Long v. Wide

SO what is this all about?

Long and wide dataframe layouts mainly affect readability. For humans, the wide format is often more intuitive since we can often see more of the data on the screen due to its shape. However, the long format is more machine readable and is closer to the formatting of databases. The ID variables in our dataframes are similar to the fields in a database and observed variables are like the database values.

Researchers often want to reshape their dataframes from ‘wide’ to ‘longer’ layouts, or vice-versa. The ‘long’ layout or format is where: * each column is a variable * each row is an observation In the purely ‘long’ (or ‘longest’) format, you usually have 1 column for the observed variable and the other columns are ID variables.

For the ‘wide’ format each row is often a site/subject/patient and you have multiple observation variables containing the same type of data. These can be either repeated observations over time, or observation of multiple variables (or a mix of both). You may find data input may be simpler or some other applications may prefer the ‘wide’ format. However, many of R’s functions have been designed assuming you have ‘longer’ formatted data. (Especially ggplot!)

library(tidyr)

Question:

Is gapminder a purely long, purely wide, or some intermediate format?

Using pivot:

Until now, we’ve been using the nicely formatted original gapminder dataset, but ‘real’ data (i.e. our own research data) will never be so well organized. Here let’s start with the wide formatted version of the gapminder dataset.

Challenge:

Download this dataset into your data directory: https://raw.githubusercontent.com/swcarpentry/r-novice-gapminder/gh-pages/_episodes_rmd/data/gapminder_wide.csv

Then read the data into R and name the dataframe gap_wide:

or

gap_wide <- fread("https://raw.githubusercontent.com/swcarpentry/r-novice-gapminder/gh-pages/_episodes_rmd/data/gapminder_wide.csv", header=TRUE)

That gives us this:

and we want to practice pivoting longer with pivot_longer():

gap_long <- gap_wide %>%
  pivot_longer(
    cols = c(starts_with('pop'), starts_with('lifeExp'), starts_with('gdpPercap')),
    names_to = "obstype_year", values_to = "obs_values"
  )
str(gap_long)
## tibble [5,112 × 4] (S3: tbl_df/tbl/data.frame)
##  $ continent   : chr [1:5112] "Africa" "Africa" "Africa" "Africa" ...
##  $ country     : chr [1:5112] "Algeria" "Algeria" "Algeria" "Algeria" ...
##  $ obstype_year: chr [1:5112] "pop_1952" "pop_1957" "pop_1962" "pop_1967" ...
##  $ obs_values  : num [1:5112] 9279525 10270856 11000948 12760499 14760787 ...

can also use “-” syntax!

using separate

gap_long <- gap_long %>% separate(obstype_year, into = c('obs_type', 'year'), sep = "_")
gap_long$year <- as.integer(gap_long$year)

Challenge: Using gap_long, calculate the mean life expectancy, population, and gdpPercap for each continent. Hint: use the group_by() and summarize() functions

Going in the other direction … (time dependent)

5 Ggplot!